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Abstract — String matching is the most computation-intensive process in many applications , such as network intrusion 
detection system, web searching and biological matching. The Aho-Corasick algorithm is the most popular string matching 
algorithm because of its ability to use one thread to match all patterns in parallel In our previous work, we propose a string 
matching algorithm called Parallel Failureless Aho-Corasick algorithm to parallelize the traditional Aho-Corasick 
algorithm by adopting multiple threads on graphic processing units. Due to the advancing technology of multi -core 
processors, in this paper, we accelerate the Parallel Failureless Aho-Corasick algorithm on multi-core processors using 
multi-threaded implementation. Experimental results show that for processing large scale of inputs and patterns, the Parallel 
Failureless Aho-Corasick algorithm performing on multi-core processors delivers throughput up to 33 Gbps, 4 times faster 
than the traditional multi-threaded Aho-Corasick algorithm. Both the performance and scalability of the Parallel Failureless 
Aho-Corasick algorithm is improved on multi-core processors. 
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I. INTRODUCTION 

String matching is the most computation-intensive process in many applications, such as network intrusion detection system, 
web searching and biological matching. For example, in signature -based network intrusion detection systems (NIDS), string 
matching is used to inspect packet payloads against thousands of attack patterns for finding malicious packets. To 
accommodate the ever-increasing attack patterns and satisfy the high-speed network communication, accelerating string 
matching has become a critical issue. 

Due to the development of multi-core central processing units (CPUs) and graphic processing units (GPUs), many researches 
[2-9] are proposed to parallelize string matching algorithms on these multi-core machines. Among these string matching 
algorithms, the Aho-Corasick [1] algorithm is widely adopted because of its ability of searching multiple patterns 
simultaneously. In our previous works [2] [3], we modify the AC algorithm and propose a parallel algorithm called Parallel 
Failureless Aho-Corasick (PFAC) algorithm to accelerate string matching on graphic processing units (GPUs). We have also 
released the open source library called PFAC [11] on Google Code. The PFAC algorithm performing on NVIDIA GPUs 
achieves remarkable performance improvement over the AC algorithm. However, because string matching is a memory - 
intensive application, the overhead of communication between CPU and GPU has significant impact on the performance of 
the PFAC algorithm performing on GPUs. In addition, the memory capacity of GPUs is another critical issue to process big 
data. 

The advancing technology of transistor integration is producing increasing powerful multi-core processors. For example, the 
Intel® Xeon® E7-8870 [15] processor contains up to ten cores per processor which provides up to twenty threads per 
processor with Intel® hyper-threading technology. The Xeon E7 processors feature up to 30MB F3 cache which can be 
dynamically shared by all cores. The memory capacity supports up to 4,096 GB. In addition, Amazon Elastic Compute Cloud 
(Amazon EC2) [14] provides high memory cluster instance which features up to 244GB memory on Xeon processors. 
Without the overhead of PCIe communication, we would like to evaluate the performance of the PFAC algorithm performing 
on these multi-core processors. 

In this paper, we accelerate the PFAC algorithm on multi-core processors using OpenMP [12] library and evaluate the 
performance and scalability of the PFAC algorithm on multi-core processors. The results are compared with a multi-threaded 
Aho-Corasick algorithm. For processing 16 GB inputs with 10K patterns, the PFAC algorithm achieves up to 33 Gbps 
throughput on the Intel® Xeon® processors, which delivers 4 times of performance improvement over the multi -threaded AC 
algorithm. 
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II. Review of Parallel Failureless Aho-Corasick Algorithm 

Before we review the Parallel Failureless Aho-Corasick algorithm, we first introduce a data parallel approach adopted by 
many researches [4] [5] [6] [7] [8] [9] to parallelize the Aho-Corasick algorithm, referred as the Data Parallel Aho-Corasick 
(DP AC) approach. The DP AC approach first uses the traditional AC algorithm to compile string patterns into a finite state 
machine of which the state transition table is called the AC state transition table. Then, the DP AC approach divides input 
string into multiple segments and assigns each segment an individual thread to perform string matching by traversing the AC 
state transition table. The DP AC has a well-known boundary detection problem that the pattern located across the boundary 
cannot be found. To resolve the boundary detection problem, each thread of DP AC must scan across the boundary for an 
additional length which is equal to the longest pattern length minus one. In other words, each thread of DP AC has constant 
duration time, 0(s+m-l) where s is the segment size and m is the longest pattern length. Fig. 1 shows the AC state machine 
for matching the three patterns, “AABA”, “ABA”, and “BAB”. In Fig. 1, valid lines represent valid transitions for specific 
input characters while the dotted lines represent failure transitions which are taken when no valid transition exists for an 
input character. Fig. 2 shows the DP AC approach where the input string is divided into multiple segments and each segment 
is assigned a thread to traverse the AC state machine. Because the length of the longest pattern, “AABA” is four, each thread 
has to scan across the boundary for three characters to resolve the boundary detection problem. In the example, the thread #3 
will find the pattern “AABA” occurring in the boundary of segment #3 and #4. In Section IV, we will evaluate the 
relationship between the performance of DP AC, thread number and scheduling to realize how to achieve the maximum 
performance on multi-core processors. 




Figure 2. Each thread has constant duration time equal to the segment size 

PLUS THE LONGEST PATTERN LENGTH MINUS ONE. 
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To improve the efficiency of DP AC, the PFAC [2] [3] algorithm is proposed to accelerate string matching using multiple 
threads on GPUs. Different from the DP AC, the PFAC assigns each input byte an individual thread as shown in Fig. 3. Each 
thread in PFAC is only responsible of finding patterns from its starting position. Therefore, all failure transitions can be 
removed and only valid transitions exists as shown in Fig. 4. The compact state machine is referred as Failureless -AC state 
machine. Whenever the thread cannot find a valid transition for an input character, the thread terminates immediately without 
taking failure transitions. For example, Fig. 3 shows the status of each thread where the states activated are marked as red 
color. In Fig. 3, threads #5 and #6 reach the final states and find the pattern “AABA” and “ABA”, respectively. Except for 
the threads #5 and #6, the other threads terminate in the early stages. Different from the DP AC algorithm, the duration time 
of threads is variant from 0(1) to 0(m), where m is the longest pattern length. Table 1 summarizes the time complexity of a 
thread in the AC, DP AC, and PFAC, where n, s, and m represent the input length, segment size and the pattern length, 
respectively. Compared to the AC and DP AC algorithms, the PFAC theoretically has the best time complexity. 

Furthermore, each final state in the PFAC state machine only represents a unique pattern. For example in Fig. 4, the state 4 
only represents the final state of pattern “AABA” while in Fig. 1, state 4 represents the final states of “AABA” and “ABA”. 
Based on the property, we can use state encoding to represent final states. Then, the output table can be removed. Because 
the output table is eliminated, the performance of PFAC can be further improved. 


Thread #2 Thread #4 Thread #§ Hum! #8 



ill! 

AAAAAABA 



Thread#! Thread #3 Thread #5 Thread #7 


Figure 3. Parallel Failureless-AC algorithm 



Figure 4. Failureless-AC state machine of the patterns “AABA”, “ABA”, and “BAB” 
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The PFAC is very adaptive to be implemented on GPUs because GPUs can issue huge amount of threads by their hundreds 
or thousands of physical cores simultaneously. However, the overhead of communication between CPU and GPU limits the 
total system throughput. In addition, memory capacity of GPUs is another critical issue for processing large scale of data. 
Because the advance of CPU technology, the latest CPUs, such as Intel® Xeon® E7-8870 processor features up to ten cores 
per processor which allow twenty threads per processor with Intel® hyper-threading technology. In addition, the new Xeon 
processors have large memory capacity including 30MB L3 cache and 4,096 GB main memory. With the increasing number 
of cores per processor and high memory capacity, performing PFAC on multi -core processors can deliver significant 
performance for processing large scale of data. 


Table 1 

Comparison of matching time complexity 


Algorithm 

Time Complexity of a thread 

Aho-Corasick (AC) 

0(n) 

Data Parallel Aho-Corasick (DPAC) 

0(s + m ) 

Parallel Failureless Aho-Corasick (PFAC) 

0(1) to 0 (m) 


III. Parallelization on Multi-core Processors 

In this paper, we evaluate the performance of the PFAC algorithm and the DP AC algorithm on multi -core CPUs using multi- 
threaded implementations. Specifically, we put emphasis on the scalability of the PFAC algorithm to accommodate large 
scale of inputs and patterns. Finally, we evaluate the performance of two types of thread scheduling. OpenMP [12] library is 
adopted to parallelize the DP AC and PFAC algorithm on multi -core CPUs. OpenMP is a multithreading programming model 
which allows forking multiple threads to run concurrently in different processors. OpenMP supports multi - 
platform and operating systems, including Linux, Mac OS X, and Windows platforms, and provides a set of compiler 
directives, library routines, and environment variables that control run-time behavior. 

OpenMP library has two major types of scheduling, static scheduling and dynamic scheduling. The static scheduling 
allocates all threads equal iterations before the threads are executed while the dynamic scheduling allocates small number of 
iterations to a smaller number of threads. When a thread finishes its allocated iterations, the thread returns to get new 
iterations. The parameter chunk defines the number of contiguous iterations that are allocated to a thread at a time. 

As shown in Table 2, for processing an input of length n , the DP AC needs to fork n/s parallel threads where s is the segment 
size. Because the DP AC divides inputs into small number of segments equal to the number of virtual cores and each thread 
has constant duration time, static scheduling with chunk value of 1 will utilize all virtual cores (threads) to work 
simultaneously. On the other hand, because the PFAC needs a lot of threads and each thread has different duration time, 
dynamic scheduling with large chunk value would satisfy the behavior of the PFAC. 


Table 2 

Comparisons of DPAC and PFAC in terms of scheduling 


Algorithm 

# of thread 

Thread duration time 

scheduling 

DPAC 

n/s 

constant 

static 

PFAC 

n 

dynamic 

dynamic 


IV. Experimental Results 

To evaluate the performance and scalability of the DPAC and PFAC algorithm, the input benchmarks are generated from 
DEFCON [11] of which sizes are from 256MB to 32GB. The string patterns are extracted from the signature strings of Snort 
[13] and are grouped into three sets as shown in Table 3. We adopt Amazon EC2 as our experimental environment. Table 4 
shows the EC2 instances we choose for evaluating the PFAC and DPAC on different number of cores. OpenMP [12] library 
is adopted to parallelize the DPAC and PFAC algorithm to achieve optimum performance on multi -core CPUs. All 
implementations are compiled using GCC 4.6.1 with the compiler flags “-02”. The throughput is defined as the input length 
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divided by the elapsed time of performing string matching. The preprocessing time for creating the state transition table and 
opening input files are not taken into account. 


Table 3 

Three pattern sets extracted from Snort 


Pattern 

# of rules 

# of characters 

# of states 

Set 1 

1,998 

41,997 

27,754 

Set 2 

4,414 

98,611 

70,284 

Set 3 

10,075 

187,329 

126,776 


Table 4 

Experimental machines 


Instance type 

CPU 

Virtual cores 

Mem 

Memory bandwidth 

M2.xlarge 

Xeon® X5550 

2 

17.1GB 

32GB/s 

M3.xlarge 

Xeon® E5-2670 

4 

15GB 

51.2GB/S 

M3.xlarge 

Xeon® E5-2670 

8 

30GB 

51.2GB/S 

CC1.4xlarge 

Xeon® X5570 

16 

23GB 

32GB/s 

CR1.8xlarge 

Xeon® E5-2670 

32 

244GB 

51.2GB/S 


First, we evaluate the performance of the PFAC and DP AC on the Amazon computing units with different number of cores 
including 2, 4, 8, 16, and 32 cores. For processing the inputs of 2GB and the pattern set 3, Fig. 5 shows that both the PFAC 
and DP AC achieve performance improvement proportional to the number of cores. Furthermore, the PFAC delivers 
throughput up to 33Gbps on 32 cores, 4.6 times faster than the DP AC. The results show that the performance of PFAC scales 
up with the number of cores. 



Figure 5. Performance comparison on different number of cores 


Second, we choose the high memory cluster instance, CR1.8xlarge to evaluate the relationship between performance, thread 
number, input size, and pattern size. The CR1.8xlarge instance is equipped with two Intel® Xeon® E5-2670 CPUs where 
each one has eight cores operating at 2.6 GHz. The main memory is 244 GB with the maximum bandwidth of 51.2GB/s. 
With hyper-threading technology, the 16 physical cores can issue 32 threads simultaneously. In other words, the number of 
virtual cores is 32. 

To evaluate the performance of different number of threads, the number of threads varies from single thread to 512 threads. 
Fig. 6 shows for processing the 16GB inputs, the performance of DP AC increases with the increasing number of threads. For 
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processing small pattern set (27,754 states), the DP AC achieves throughput up to 40 Gbps. But, the performance of DPAC 
decreases significantly, less than lOGbps when processing the largest pattern set (126,776 states). 



Figure 6. Performance of DPAC with different number of threads 

On the other hand, the PFAC outperforms the DPAC on processing the largest pattern set. Fig. 7 shows that PFAC still 
achieves throughput up to 30Gbps on the largest pattern set. Fig. 8 compares the performance of DPAC and PFAC for 
processing 16GB inputs and the largest pattern set (126,776 states). The PFAC delivers up to 4 times performance 
improvement over the DPAC with different number of threads. In addition, we can find that both the DPAC and PFAC 
algorithms are saturated in performance when the number of threads exceeds 32. This result shows that the maximum 
performance is dominated by the number of virtual cores. 



Figure 7. Performance of PFAC with different number of threads 



Figure 8. Performance comparison of DPAC and PFAC 

Thirdly, we evaluate the scalability of DPAC and PFAC in terms of the scale of inputs and patterns. The size of inputs varies 
from 256MB to 32GB. We allocate 32 threads to achieve the maximum performance. Fig. 9 shows the DPAC has 
considerable performance on processing the inputs larger than 8GB. Fig. 10 shows the PFAC also has good performance on 
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processing the large-scale inputs. Fig. 1 1 shows PFAC delivers 4 times performance improvement over DP AC on large scale 
of inputs. 



Figure 9. Performance of DPAC on different size of inputs 


Performance of PFAC on different size of inputs (MB) 


60.00 

50.00 

40.00 

J - 30.00 

20.00 
10.00 

0.00 

^ « • • • ♦ 






256 

512 

1024 

2048 

4096 

8192 

16384 

32768 

—♦—PFAC -27754 

48.18 

51.41 

53.67 

54.26 

54.85 

55.10 

55.36 

55.27 

-■-PFAC- 70284 

31.55 

33.34 

33.90 

34.35 

34.45 

34.85 

34.63 

36.12 

-A-PFAC- 126776 

30.92 

31.92 

32.64 

33.17 

33.36 

33.22 

33.26 

32.67 


Figure 10. Performance of PFAC on different size of inputs 
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Figure 11. Performance comparison of DPAC and PFAC on different size of inputs 
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Finally, we evaluate the impact of thread scheduling on the performance of DP AC and PFAC. Fig. 12 shows that the DP AC 
achieves better performance using static scheduling with setting chunk as 1. Fig. 13 shows the PFAC achieves better 
performance using dynamic scheduling with large chunk value. 



Figure 12. Performance of DPAC with static and dynamic scheduling 
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Figure 13. Performance of PFAC with static and dynamic scheduling 


V. Conclusion 

In this paper, we have evaluated the performance and scalability of the DPAC and PFAC algorithm on multi -core processor 
to process large-scale inputs and patterns. We also evaluate the impact of thread scheduling on the performance of DPAC 
and PFAC. Experimental results show that the PFAC algorithm performing on multi -core processor achieves significant 
improvement in performance over the DPAC algorithm. 
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